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Abstract 

Motivation: The presence of reticulate evolutionary events in phylogenies turn phyloge- 
netic trees into phylogenetic networks. These events imply in particular that there may exist 

■ multiple evolutionary paths from a non-extant species to an extant one, and this multiplicity 
^ ' makes the comparison of phylogenetic networks much more difficult than the comparison of 

phylogenetic trees. In fact, all attempts to define a sound distance measure on the class of 
all phylogenetic networks have failed so far. Thus, the only practical solutions have been 
either the use of rough estimates of similarity (based on comparison of the trees embedded 
in the networks), or narrowing the class of phylogenetic networks to a certain class where 
£C) • such a distance is known and can be efficiently computed. The first approach has the prob- 

lem that one may identify two networks as equivalent, when they are not; the second one 
has the drawback that there may not exist algorithms to reconstruct such networks from 
biological sequences. 

Results: We present in this paper a distance measure on the class of tree-sibling time 
• i-h . consistent phylogenetic networks, which generalize tree-child time consistent phylogenetic 

' networks, and thus also galled-trees. The practical interest of this distance measure is 

twofold: it can be computed in polynomial time by means of simple algorithms, and there 
also exist polynomial-time algorithms for reconstructing networks of this class from DNA 
sequence data. 

Availability: The Perl package Bio : :PhyloNetwork, included in the BioPerl bundle, imple- 
ments many algorithms on phylogenetic networks, including the computation of the distance 
presented in this paper. 
Contact: gabriel . cardona@uib . es 



1 Introduction 

Phylogenies reveal the history of evolutionary events of a group of species, and they are central to 
comparative analysis methods for testing hypotheses in evolutionary biology [15j . Although phy- 
logenetic trees have been used since the early days of phylogenetics [3] to represent evolutionary 
histories under mutation, it is currently well known that the existance of genetic recombinations, 
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Figure 1: Node v is quasi-sibling of it. 

hybridizations and lateral gene transfers makes species evolve more in a reticulate way that in a 
simple, arborescent way [7j. 

Now, as it happens in the case of phylogenetic trees, given a set of operational taxonomic 
units, different reconstruction algorithms, or different sets of sampled data, may lead to different 
reticulate evolutionary histories. Thus, a well-defined distance measure for phylogenetic networks 
becomes necessary. 

In a completely general setting, a phylogenetic network is simply a directed acyclic graph 
whose leaves (nodes without outgoing edges) are labeled by the species they represent jT5] Q15] . 
However, this situation is so general that even the problem of deciding when two such graphs are 
isomorphic is computationally hard. Hence, one has to put additional constraints to narrow down 
the class of phylogenetic networks. There have been different approaches to this problem in the 
literature, giving rise to different definitions of phylogenetic network; see p^l8jl^lT3 l lT6 l [T8| , ll9|. 

In this paper, we give a distance measure on the class of tree-sibling time consistent phylo- 
genetic networks. This class first appeared in Nakhleh's thesis [13., and it is of special interest 
because there exist algorithms to reconstruct phylogenetic networks of this class from the anal- 
ysis of biological sequences \W \ 111 ] . However, all previous attempts to provide a sound distance 
measure on this class of networks have failed [6] . 

2 Tree-sibling time consistent phylogenetic networks 

Let N = (V,E) be a directed acyclic graph, or DAG for short. We will say that a node u is a 
tree node if indeg(u) ^5 1; moreover, if indeg(w) = 0, we will say that it is a root of N. If a single 
root exists, we will say that the DAG is rooted. We will say that a node u is a hybrid node if 
indeg(u) 2. A node u is a leaf if outdeg(u) = 0. 

In a DAG N = (V,E), we will say that v is a child of u if (u,v) € E; in this case, we will 
also say that u is a parent of v. Note that any tree node has a single parent, except for the roots 
of the graph. 

Whenever there exists a directed path (eventually trivial) from a node u to v, we will say 
that v is a descendant of it, or that it is an ancestor of v. 

We will say that two nodes it and v are siblings of each other if they share a parent. Note 
that the relation of being siblings is reflexive and symmetric, but not transitive. 

We will say that a tree node v is quasi-sibling of another tree node u if the parent of v is a 
hybrid node that is also a sibling of u: see Fig. [ffl. The relation of being quasi-siblings is neither 
reflexive nor symmetric. 

A phylogenetic network on a set S of labels is a rooted DAG such that: 

• No tree node has out-degree 1. 

• Every hybrid node has out-degree I, and its single child is a tree node. 

• Its leaves are bijectively labeled by S. 

Moreover, if all hybrid nodes have in-degree equal to two, we will say that it is a semi-binary 
phylogenetic network. Note that semi-binarity does not impose any further condition on the 
out-degree of tree nodes. 

The underlying motivation for such definitions is that tree nodes represent species, the leaves 
corresponding to extant ones, and the internal tree nodes to ancestral ones. Hybrid nodes model 

1 Henceforth, in graphical representations of phylogenetic networks, hybrid nodes are represented by squares, 
tree nodes by circles, and indeterminate nodes (that is, that can be either tree or hybrid nodes) by both of them 
superposed. 
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Figure 2: A sbTSTC phylogenetic network. 



recombination events, where the parents of a hybrid node correspond to the species involved in 
this process, and its single child corresponds to the resulting species. Hence, the semi-binarity 
condition means that these events always involve two, and only two, species. 

Although in real applications of phylogenetic networks, the set S labeling the leaves would 
correspond to a given set of taxa of extant species, for the sake of simplicity we will hereafter 
assume that the set of labels is simply S — {1, . . . , n}. 

We will say that a phylogenetic network is tree-sibling if each hybrid node has at least one 
sibling that is a tree node. 

Biologically, this condition means that for each of the hybridization processes, at least one 
of the species involved in it has also some descendant through mutation. 

A time assignment on a network A = (V, E) is a mapping r : V — ► N such that: 

1. r(r) = 0, where r is the root of A. 

2. If v is a hybrid node and (u, v) G E, then t(u) = t(v). 

3. If v is a tree node and (u,v) € E, then t(u) < t(v). 

We will say that a network is time consistent if it admits a time assignment [2] . 

From a biological point of view, a time assignment represents the time when a certain species 
exists, or a certain hybridization process occurs. Note that whenever such a process takes place, 
the species involved must coexist; this is what the time-consistency property ensures. 

By a sbTSTC network we will mean a semi-binary tree-sibling, time consistent phylogenetic 
network, and this will be the class of phylogenetic networks that we will consider in the rest of 
the paper. 

Remark. Besides the biological considerations we have made while presenting our assumptions 
on phylogenetic networks, these are also motivated by the fact that we want to single out 
phylogenetic networks by means of their /i-representation (see section [3] below) . In section [7] 
we give examples showing that the technical conditions imposed on phylogenetic networks are 
necessary to achieve this goal. 

Remark. We have mentioned in the introduction that the class of semi-binary tree-sibling time 
consistent phylogenetic networks generalizes those introduced in [TJ]. Namely, the latter are 
obtained from a phylogenetic tree by repeating the following procedure: 

1. choose a pair of arcs (ui,Vi) and (1*2,^2) in the tree; 

2. split these arcs by introducing intermediate nodes w\ (that will become a tree node) and 
W2 (that will become a hybrid node), respectively; 

3. add a new arc (wi,W2)- 

Each hybrid node introduced, W2 in the notations above, has a tree sibling, namely v±. Hence, 
the networks obtained by this procedure are sbTSTC networks. However, the sbTSTC network 
A 3 in Fig. H] cannot be obtained by the procedure above from a tree T. Indeed, the described 
procedure cannot introduce tree nodes with out-degree greater that 2; hence node a in A3 should 
also be a node of T, and the out-degree of r in T would be 1, yielding to a contradiction. 

The following result ensures the existence of sibling or quasi-sibling leaves in sbTSTC net- 
works. 

Lemma 1. Let A be a sbTSTC network. Then, there exists at least one pair of leaves that are 
either siblings or quasi-siblings. 
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Table 1: Number of sbTSTC networks for small number n of leaves. 



n 


1 


2 


3 


4 


5 


Number of networks 


1 


1 


10 


606 


215 283 



Proof. Let M be the set of internal nodes of N with maximal time assignment. 

If no node of M is hybrid, let u £ M be any tree node. Then, all its children are leaves: 
indeed, if a child of u were an internal tree node, then its time assignment would be strictly 
greater than that of u, against our assumption; also, if a child of u were a hybrid node, then its 
time assignment would be the same as that of u, and hence M would contain a hybrid node. 
Therefore, since we do not allow out-degree 1 tree nodes, the node u has at least two children 
that are leaves, and these leaves are siblings. 

If M contains a hybrid node v, then its parents are tree nodes u, u' with the same time 
assignment as that of v, and at least one of them must have a tree child because of the tree- 
sibling property. Say that u has a tree child; the same argument as before proves that this child 
must be a leaf i. Moreover, the single child of v must be a tree node, hence also a leaf j. In this 
situation we have that j is a quasi-sibling of i. □ 

We give now tight bounds for the number of hybrid and internal tree nodes of a sbTSTC 
phylogenetic network, depending on its number of leaves. The existance of such bounds implies, 
in particular, that there exists a finite number of sbTSTC phylogenetic networks on a given set of 
taxa up to isomorphisms. Nevertheless, we have not yet been able to find a closed expression for 
this number of networks depending only on the number of leaves. Table[T]shows the experimental 
results we have found in this direction using the procedure described in Section [5J 

Proposition 2. Let N be a sbTSTC network. Let n,h,t be, respectively, the number of leaves, 
the number of hybrid nodes and the number of internal tree nodes of N. Lf n ^ 2, then h = 
and t = n — 1. Otherwise, h 2n — 4 and t ^ 3rt — 6. 

Proof. The result is obvious if n ^ 2, since then TV is a tree. 

Assume that n 3 and that the result is proved for networks with less than n leaves. Let M 
be the set of internal nodes with maximum time assignment, and let M t (respectively, M/j) be 
the set of tree nodes (respectively, hybrid nodes) in M. Notice that M t is non-empty, because 
if a hybrid node has maximum time assignment, its two parents have the same time assignment 
and, therefore, are in M t . Consider the following different situations: 

1. If some node u in M t has two (or more) children leaves, let N' be the sbTSTC network 
obtained by removing one of these leaves and eventually collapsing the created elementary 
path into a single arc. Then the number of leaves, hybrid nodes and internal tree nodes in 
N' is 

n = n — 1, h = h, t = t — e, 

with e = if the out-degree of u in N is greater than two, and e = 1 otherwise. Now, from 
the induction hypothesis we get 

h = ti «S 2n' - 4 = 2n - 2 - 4 < 2n - 4, 

t = t' + e ^ in 1 - 6 + e = 3n - 9 + e < 3n - 6. 

2. If (1) does not hold, but every node in M t has one child leaf, let N' be the sbTSTC network 
obtained by removing all the nodes in Mh, together with their respective children leaves 
(say k = \Mh\), and collapsing the created elementary paths into single arcs. In this case 
we have that 

n = n — k, h — h — k, t = t — k, 

where k 2k is the number of elementary paths that have been removed. Now, also from 
the induction hypothesis we get 

h = h' + k4:2n'-4:+k=2n-2k-A + k = 2n-A-k<2n-A, 
t = t' + k ^ 3n' - 6 + k = 3n - 3k - 6 + k < 3n - 6. 
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3. If neither (1) nor (2) hold, then there exists a node u £ M t such that all its children, say 
vi, . . . ,Vk (k ^ 2), are in M%. Let N' be the sbTSTC network obtained by removing all 
nodes v\ 1 . . . , Vk together with their respective children leaves, and collapsing the created 
elementary paths into single arcs. Notice that the node u is no longer an internal tree 
node, but a leaf of N'. Then, the number of nodes in N' is 

n' = n-k + l, h' = h-k, t' = t - k - I, 

where k k is the number of elementary paths that have been removed. Now, the induction 
hypothesis yields 

h = h' + fc < 2n' -4 + fc = 2n-2fc + 2- 4+ fc = 2n-fe-2^2n-4, 

i = i' + fc+l<3n'-6 + fc + l = 3n-2-3A; + fc<3n-2-2fc<3n-6. 

Hence, in all cases, the result follows. 

□ 

The bounds in the proposition above are tight, as the following example shows. 

Example 1. Consider the family of sbTSTC phylogenetic networks {N n ) n ^ defined recursively 
in the following way: 

• N3 is the first phylogenetic network depicted in Fig. [U 

• The network AT n+1 is obtained from N n by applying the transformation described in Fig. [3] 
Fig. [H depicts also N± and N$, where we label the internal nodes in these networks to ease 
understanding of the construction. 

Note that all networks N n are semi-binary and tree-sibling by construction. Also, the time 
consistency property can be easily verified: when constructing N n+ \ from N n , we can assign to 
each of the internal nodes introduced the maximum of the times that the leaves 1, 2, n have in 
N n , and reassign to the leaves 1, 2, n, n + 1 this maximum plus one. 




n+l 



Figure 3: The transformation that produces N n+ \ from N n . 

Now, N3 has 3 internal tree nodes and 2 hybrid nodes, and the construction of N n+ i from 
N n adds 3 internal tree nodes and 2 hybrid nodes. It is evident, then, that each N n has 3(n — 2) 
internal tree nodes and 2{n — 2) hybrid nodes. 

3 The mu-representation 

In [3] we introduced the /^-representation for a different class of phylogenetic networks, the so- 
called tree-child phylogenetic networks, those networks where every internal node has at least 
one child that is a tree node. We remark that the tree-child condition is more restrictive than 
the tree-sibling one; nevertheless, the additional condition of time consistency that we use here 
makes that none of the two classes is contained in the other one. 

In this section we review the definition of the /^-representation of phylogenetic networks, and 
we will prove later that this representation characterizes a sbTSTC phylogenetic network, up to 
isomorphism. 

Let N — (V, E) be a phylogenetic network on the set S = {1, . . . , n}. For each node u of N, 
we consider its fj,-vector, 

(jt(v) = (mi(u), . . . ,m„(u)), 
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Figure 4: Maximal sbTSTC phylogenetic networks with 3, 4, and 5 leaves. 
Table 2: //-representation of the network in Fig. [21 



node 


/it-vector 


r 


(1,2,2,1) 


u 


(1,1,0,0) 


V 


(0,1,1,0) 


w 


(0,0,1,1) 


A 


(0,1,0,0) 


B 


(0,0,1,0) 



where rrii(u) is the number of different paths from u to the leaf i. Moreover, we define the 
\i- representation of N, n(N), as the multiset 

H{N) = {/*(«) \ ueV}, 

with each element appearing as many times as the number of different nodes having it as its 
/i- vector. 

For each leaf i, we have that its /i- vector is fj,(i) = 6(i), with 6(i) the vector with at 
each position, except at its i-th position, where it is 1. As for the other nodes, we have that 
m( w ) = J2v k ^( v k), where the sum ranges over the set of children of u [U Lemma 4]. This 
property allows for the computation of n(N) in polynomial time (see Section [6] below). 

Example 2. Consider the sbTSTC phylogenetic network in Fig. [2] In Table [2] we give its //- 
representation, except for the leaves, whose /i-vector is trivial. 

In the next section we will introduce a set of decomposition/reconstruction procedures for 
sbTSTC phylogenetic networks. It will turn out that the application conditions for these proce- 
dures can be read from the /^-representation of the network. 

Lemma 3. Let N be a sbTSTC phylogenetic network, i,j a pair of leaves, and let u be the parent 
of i. Then j is sibling or quasi- sibling of i if, and only if: 

1. fJ>(u) is minimal in the set 

M = { f x£f,(N)\ t i^6(i)+S(j)}. 

2. The multiset 

Mi = {/i G fi(N) | /*(«) > M > S(i)} 

is equal to {5(i)}. 
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3. The multiset 

M s = {fi G n{N) | /x(u) > /i > <5(j)} 
is equal to {S(j)} (when j is sibling of i) or to S(j)} (when j is quasi-sibling ofi). 

Proof. Let us assume that j is sibling or quasi-sibling of i. In either case, both i and j are 
descendants of u, so that /j,(u) G M. Now, for any other node w with fi(w) € M, we have that 
w ^ i and it is an ancestor of i, hence it is also an ancestor of u, and therefore fi(w) ^ 
hence, fi(u) is minimal in M. Moreover, the only fj,- vector in Mi is S(i), with multiplicity 1, 
because the only ancestor of i that is a non-trivial descendant of u is the leaf i itself. The 
situation for Mj is analogous, taking into account that Mj contains a second copy of 6(j) in the 
case that the parent of j is hybrid. 

As for the converse, let us assume that for a node w, its /i-vector is minimal in M. Note 
that, since a hybrid node and its single child (a tree node) have the same ^-vector, we can 
assume that w is a tree node. Because of the definition of M, we have that w is an ancestor of 
both i and j. Now, if some child v of w were an ancestor of both i and j, we would have that 
fJ*{w) > (J,(v ) (5(i) + <5(j), against our assumption on the minimality of n(w) in M. Therefore, 
w has two children Vi,Vj such that Vi is ancestor of i (but not of j) and Vj is ancestor of j (but 
not of i). Then, fi(vi) G Mi and, by the uniqueness of the element in Mi, we have that i>j = i, 
and it follows that tu is the parent of i, that is, w = u. Symmetrically, we have that Vj G Mj. 
Now, two situations may arise: first, if the multiplicity of 5(J) in Mj is one, then Vj — j and j 
is a sibling of i; second, if this multiplicity is two, then Vj must be a hybrid node whose single 
child is j, hence j is quasi-sibling of i. 

□ 

Lemma 4. Let N be a sbTSTC phylogenetic network. Let j be a leaf sibling or quasi-sibling of 
another leafi, and letu be the parent ofi. Then, outdeg(u) = 2 if, and only if, (J,(u) = 5(i)+8(j). 

Proof. Note that with the assumptions made, and by the previous lemma, we have that fi(u) 
S(i) + S(J). Now, the equality holds if, and only if, u has no other children apart from i and j 
(in case that j is sibling of i) or the hybrid parent of j (in case that j is quasi-sibling of i). 

□ 

For future reference, we gather these last results into the following proposition. 

Proposition 5. Let N be a sbTSTC phylogenetic network. The following properties can be 
decided from the knowledge of [x(N): 

1. Two leaves are siblings, or not. 

2. A leaf is quasi-sibling of another one, or not. 

3. A leaf is sibling or quasi-sibling of another leaf, and the parent of the latter has out-degree 
2, or greater than 2. 

4 The reduction procedures 

We now introduce four reduction procedures that decrease either the number of leaves or of 
hybrid nodes in a sbTSTC phylogenetic network. 

The T reduction. 

Let N be a sbTSTC phylogenetic network on S, i,j two sibling leaves, u their common parent, 
and assume that outdeg(u) > 2. The DAG Nj>(ij) is obtained by removing from N the leaf j 
and its incoming arc; see Fig. [5] 

It is easy to check that the obtained DAG is a sbTSTC phylogenetic network on S \ {j}. 
Indeed, if the removed node j were a sibling of some hybrid node x, then i would still be a 
tree node sibling of x in N^u j) , hence the tree-sibling condition is preserved. Also, the time 
consistency and semi-binarity conditions are trivially preserved. 
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Figure 5: The T reduction. 




Figure 6: The TR reduction. 



Note that, given N T ^^, we can reconstruct N, up to isomorphism, by simply adding the 
leaf j and an arc from the parent of i to j. 

Note also that the //-representation of Ntuj) can be easily obtained from that of N. Indeed, 
for any node u (except for the deleted leaf, which implies removing S(j) from fx(N)) we have 
that its //-vector in the reduced network is the same that in the original network but with the 
j-th component removed. 

The TR reduction. 

Let TV be a sbTSTC phylogenetic network on S, i,j two sibling leaves, u their common parent, 
and assume that outdeg(u) = 2. Suppose also that TV is not a tree with two leaves, which is 
equivalent to have that u is not the root of N. The DAG Ntr^j) is obtained by removing from 
N the leaf j and its incoming arc, and collapsing the created elementary path into a single arc; 
see Fig. [5] 

As in the previous case, the resulting network is a sbTSTC phylogenetic network on S\{j}. 
Indeed, if the node u in N is sibling of a hybrid node w, then in the obtained network Ntr^j) 
the leaf i is a sibling of w. 

Analogously to the previous case, given Ntru j), we can reconstruct N up to isomorphism 
by simply adding the leaf j, splitting the arc with head i by introducing an intermediate node 
u, and adding an arc from u to j. 

Moreover, the //-representation of Ntr^j) can be easily obtained from that of N. The 
procedure is analogous to the previous case, taking into account that we have also to remove 
from n(N) a node with //- vector equal to 5{i) + S(j). 

The H reduction. 

Let N be a sbTSTC phylogenetic network on S, j a leaf quasi-sibling of another leaf i, u the 
parent of i, v the parent of j, and assume that outdeg(u) > 2. The DAG Nhuj) is obtained by 
removing from N the arc (u, v) and collapsing the resulting elementary path with intermediate 
node v into a single arc; see Fig. [7] 

Since we have only removed a hybrid node of N, when collapsing the elementary path, it is 
straightforward to check that the obtained DAG is a sbTSTC phylogenetic network on S. 

Now, given Nhuj), we can reconstruct N up to isomorphism by simply splitting the arc with 
head j by introducing an intermediate node v, and adding an arc from the parent of i to v. 

Note that the //-representation of Nhuj) can be easily obtained from that of N. Namely, 
for every node x (except for the removed hybrid node, which implies removing one copy of S(j) 
from n(N)) we have that if //jv(^) = (mi (a;), . . . , m n (x)), then fiN Hlij) (x) = (m[(x), . . . ,m' n {x)) 
with 

/ / s \m k (x) if k ^ j, 

m ^ x ' = 1 n m - f u ■ 
I nij (x) — rrii [x) it k = j. 
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Figure 7: The H reduction. 




Figure 8: The HR reduction. 



This follows from the fact that we have only removed the paths x ~~> j that pass through the 
parent of i, which are in bijection with the paths x~*>i. 

The HR reduction. 

Let N be a sbTSTC phylogenetic network on S, j a leaf quasi-sibling of another leaf i, u the 
parent of i, v the parent of j, and assume that outdeg(w) = 2. The DAG NnRUj> is obtained 
by removing from N the arc (u, v) and collapsing the created elementary paths with respective 
intermediate nodes u and v into single arcs; see Fig. [8] 

The fact that the obtained DAG is a sbTSTC phylogenetic network on S follows as in the 
previous cases. 

Also, given N HR ^ j^ we can reconstruct TV by simply splitting the arcs with respective heads 
i,j by introducing intermediate nodes u, v, and adding an arc from u to v. 

Moreover, the /^-representation of NuR{i,j) can be also obtained from that of TV. The proce- 
dure is the same as in the last case, taking into account that we have also to remove from fi(N) 
a node with /i-vector equal to S(i) + S(j). 

Example 3. In Fig. [9] we show a sequence of reduction processes that, applied to the network in 
Fig. reduce it to a tree with two leaves. 

Remark. The construction given in Example [1] for the networks with maximal number of nodes 
can also be described in terms of the reductions (or rather their inverses) we have defined. Indeed, 
N n+ i can also be described as the network obtained from N n by application of the inverses of 
the reductions TR(2, n + 1), HR(1, 2), and HR(n, n + 1) (in this order). 

5 The mu-distance 

For any pair of phylogenetic networks N\, N2 on the same set of leaves, let 



where both the symmetric difference and the cardinality operator refer to multisets. 

Our main result in this paper is that this mapping gives a distance on the class of sbTSTC 
phylogenetic networks on a given set S of taxa. We remark that d M is also a distance on the 
set of tree-child phylogenetic networks on S and, in particular, on phylogenetic trees, where it 
coincides with the Robinson- Foulds distance [2]. 

Theorem 6. Let Ni, N2, N3 be sbTSTC phylogenetic networks on the same set of taxa. Then: 



d ll (N 1 ,N 2 ) = \n(N 1 )An{N 2 )\, 



3. 



2. 



1 



c^(iVi, N 2 ) = if, and only if, iV x S N 2 , 
d fl (N 1 ,N 2 )=d ft (N 2 ,N 1 ), 



9 




4. d t ,(N u N 3 )^d„(N 1 ,N 2 )+d l ,(N 2 ,N 3 ). 

Proof. Except for the second statement, the result follows from the properties of the symmetric 
difference of multisets. 

Also, if Ni and N 2 are isomorphic, it follows from the definition of the /i-representation that 
n{N\) and fJ>{N 2 ) are equal as multisets. 

We will prove the separation property (d M (7Vi, N 2 ) = implies that N\ = N 2 ) by induction 
on the number n of leaves and the number h of hybrid nodes. 

If n ^ 2, which implies that h — 0, the result is obvious, since there exists only two such 
sbTSTC phylogenetic networks, namely the rooted trees with 1 and 2 leaves. Also, when h = 0, 
the networks are, in fact, trees and the separation property of the Robinson-Foulds distance 
implies that Ni = N 2 . 

Let us assume that the result is proved for sbTSTC networks with at most n — 1 ^ 2 leaves, 
and with n leaves and at most h — 1 ^ hybrid nodes. Let N\, N 2 be sbTSTC phylogenetic 
networks with n leaves and h hybrid nodes. Because of Lemrna[T] there exists a pair of leaves i, j 
such that j is a sibling of i (respectively, j is quasi-sibling of i) in N\. Now since /x(iVi) = [J,(N 2 ), 
we can apply Proposition [5] to get that j is also a sibling (respectively, quasi-sibling) of i in 
N 2 . Moreover, also from Proposition [5] it follows that the out-degree of the parent of i in Ni 
is equal to 2 if, and only if, the out-degree of the parent of i in N 2 is equal to 2. From this, it 
follows that we can apply the same reduction to both networks; let N[ , N 2 the networks obtained 
from Ni , N 2 using this reduction. Since the /^.-representation of the reductions depends only on 
the //-representation of the original network and the reduction procedure applied, we get that 
n(N[) — n(N 2 ). Since now N[ and N 2 have less leaves or hybrid nodes than N% and N 2 , it follows 
from the induction hypothesis that N[ = N 2 . Finally, since we can recover up to isomorphisms 
the original networks from their reduced networks and the reductions applied, we conclude that 
Ni = N 2 . 

□ 

The tight bounds found in Section [5] for the number of internal nodes in a sbTSTC phyloge- 
netic network allow us to find the diameter of this class of phylogenetic networks with respect 
to the /i-distance, that is, the maximum of the distances between two networks in this class. 
The interest of having a closed expression for the diameter is that it allows to normalize the 
/^-distance in order to take values in the unit interval [0, 1] of real numbers. 

Proposition 7. The diameter of the class of sbTSTC phylogenetic networks with respect to 
is when n ^ 2, 9 when n = 3, and 10(n — 2) when n 4. 

Proof. The assertion for n ^ 2 is straightforward: there is only one sbTSTC phylogenetic network 
with one leaf and one sbTSTC phylogenetic network with two leaves. As far as the assertion for 
n = 3 goes, it can be easily checked by means of the direct computation of all pairs of distances: 
the largest distance is 9, and it is reached (up to permutations of labels) only by the pair of 
networks depicted in Fig. [TUJ 

Finally, in the case n ^ 4, we know that a sbTSTC phylogenetic network with n leaves has 
at most 3(n — 2) internal tree nodes and 2(n — 2) hybrid nodes, which gives an upper bound of 
5(n — 2) for the total number of internal nodes. Now, the [L- vector of the leaf i is the same in any 
sbTSTC phylogenetic network, and therefore the /i-distance between two sbTSTC phylogenetic 
networks is upper bounded by the sum of their numbers of internal nodes. 
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Figure 10: A pair of sbTSTC phylogenetic networks with 3 leaves at maximum /x-distance. 

Combining these two upper bounds, we have that, for every pair of sbTSTC phylogenetic 
networks with n leaves N and N', 

dn(N, N')^2- 5(n - 2) = 10(n - 2). 

It remains to display a pair of sbTSTC phylogenetic networks with n leaves whose /i-distancc 
reaches this equality. Such a pair must consist of two sbTSTC phylogenetic networks with 
3(n — 2) internal tree nodes and 2(n — 2) hybrid nodes each, and with disjoint sets of /i-vectors 
of internal nodes. 

One such pair is given by the network N n described in Example [T] and the network 
obtained from N n by interchanging on the one hand the labels 1 and n and on the other hand 
the labels 2 and 3. Fig. [TT] depicts N$ side by side with N 5 to ease to spot the differences between 
these networks. 




Figure 11: Two sbTSTC phylogenetic networks with 5 leaves at maximum /^-distance. 

To prove that N n and have disjoint sets of /^-vectors of internal nodes, let us start by 
studying the clusters (that is, the sets of descendant leaves) of their internal nodes. We shall 
denote the cluster of a node v in a network N by Cn(v), and we shall say that such a cluster is 
internal when v is internal. Note that if two nodes have different clusters, then they must have 
different /j,- vectors. 

The construction of N n from N n -\ changes its set of internal clusters in the following way. 
On the one hand, every internal node of N n -i survives in N n and its cluster is modified in the 
following way: 

• If 1 G Cn„_ 1 (v), then 2 is added to Cn„{v). 

• If 2 e C , 7v rl _ 1 (w), then n is added to Cjv„(«). 

• If n — 1 £ CjV n _i('f)j then n is added to C^ n {v). 

• No other leaf is added to any cluster of an internal node. 

On the other hand, this construction adds five new internal nodes with clusters 

{l,2},{2},{2,n},{n},{n-l,n}. 
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Starting with the family of internal clusters of N3 and using these rules, it is easy to prove by 
induction that the family of internal clusters of N n is (up to repetitions) 

{l,2,3,4,...,n},{2,3,4 ) ...,n},{3,4,...,n},{4,...,n},... ) {n-l,n} 1 {n}, 
{l,2,5,6,...,n},{l,2,6,...,n},...,{l,2,n-l,n},{l,2,n},{l,2}, 
{2,5,6,...,n},{2,6,...,n},...,{2,n-l,n},{2,n},{2}, 
{2,4,5,6,. ..,n}. 

Now, N' n is obtained from N n by interchanging 1 with n and 2 with 3, and therefore the clusters 
of its internal nodes can be obtained from the clusters of N n by applying this permutation. We 
conclude that the family of internal clusters of A/^ is (again, up to repetitions) 

{l,2,3,4,...,n},{l,2,3,4,...,n-l},{l,2,4,...,n-l},{l,4,...,n-l},...,{l,n-l},{l}, 
{1,3,5,6, . ..,n}, {1,3,6, ...,n},.. ., {1,3, n - l,n}, {l,3,n}, {3,n}, 
{l,3,5,6,...,n-l},{l,3,6,...,n-l},...,{l,3,n-l},{l,3},{3}, 
{1,3,4,5,6,. ..,n-l}. 

A simple inspection shows that only one cluster appears in both lists: the whole {l,...,n}. 
(Indeed, all internal clusters of N n contain the leaf n, except {1,2} and {2}. Now, on the one 
hand, the latter are not internal clusters of N' n and, on the other hand, every internal cluster 
in N' n containing n also contains 1, 3, while no internal cluster of N n other than {1, 2, 3, . . . , n} 
contains 1,3.) 

So, if a pair of internal nodes of N n and N' n have the same /^-vector, their clusters must be 
equal to {!,... ,n}. Now, both N n and have exactly two nodes with cluster {1, . . . , n}: the 
root and its out-degree 3 child a. The /x-vectors of a or r in N n are different from the /^-vectors 
of a or r in N' n : in N n , there is only one path from r and from a to 1, while in N' n it is clear that 
there is more than one such path (the parent of 1 in N' is a hybrid node, and its two parents 
are descendants of both a and r) . 

Therefore, N n and N' n have disjoint sets of /x-vectors of internal nodes and their /z-distance 
is 10(n - 2). 

□ 

As discussed before, we can now define the normalized /i-distance as 

d»(N 1: N 2 ) = — 1- d M (JV!,JV 2 ) 

if the involved networks have n > 3 leaves, or d^Ni, N 2 ) = ^(^(TVi, N 2 ) if n = 3. This way, 
d^ takes values in the interval [0, 1], and there exists pairs of networks at maximum normalized 
distance 1 for every number of leaves. 

Example 4. Consider now the phylogenetic networks in Fig. 1121 The two networks Ni,N 2 are 
adapted from networks (a) and (b) in |12[ Fig. 10] (where we have substituted the actual names 
of the species by integers identifying them); we remark that the third one in the aforementioned 
paper and figure is isomorphic to the first one. The phylogenetic tree T depicted above is 
the underlying tree from which both networks are obtained by adding edges corresponding to 
horizontal gene transfer events. Both networks are binary and time consistent; however, the first 
one is tree-child (hence tree-sibling) while the second one is not tree-child, but it is tree-sibling. 
Also, the tree can be considered a binary tree-sibling time consistent phylogenetic network. 
Hence, we can compute their /i-distances, obtaining that the two networks are more similar to 
the underlying phylogenetic tree that to each other: 

dfi,{T, Nx) = 22, d^(T, N±) « 0.169, 
d fl (T,N 2 ) - 32, d^(T,N 2 ) « 0.246, 
d^NuNz) - 38, JV 2 ) « 0.292. 
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Figure 12: Tree T (above) and networks N\ (middle), N% (below) from [T2J Fig. 10]. 
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Figure 13: Non tree-sibling, semi-binary time consistent networks with the same //- 
representation. 

6 Computational aspects 

We have already mentioned in Section [3] that the //-representation of a phylogenetic network 
can be efficiently computed by means of a simple bottom-up technique. Indeed, if we define the 
height of a node as the length of the longest path starting in this node, we get a stratification 
of nodes. The nodes with height are the leaves, and their //-vectors are trivially computed. 
Assuming that we have computed the //-vectors of nodes up to a given height h, we can compute 
the /i-vector of a node at height h + 1 by simply adding up the //-vectors of its children, that 
are already computed. If the network has n leaves, m nodes, and the out-degree of tree nodes is 
bounded by k < m, the cost of this computation is 0(kmn) = 0(m 2 n). In order to improve the 
cfficcncy of the computations of distances below, the /i-representation of the network is stored 
with the //-vectors sorted in any total order, for instance the lexicographic order; note that the 
computational cost of sorting the /i-rcprcscntation is 0(nm log m); hence, the total cost of the 
computation and sorting is still 0(m 2 n). 

Also, given two networks and their //-representations, their /i-distance can be computed 
efficiently. We can assume that the //-vectors of each network are sorted as explained above. 
Then, a simultaneous traversal of the //-representation of both networks allows the computation 
of their //-distance in 0(n(mi + m^)), where mi,TO2 are the number of nodes of each of the 
networks. 

We have implemented the computation of the //-representation of networks and the //-distance 
between them in a Perl package [5] , part of the BioPerl bundle [UJ . 

Note also that the reduction procedures introduced in Section [Hallow for the construction of 
all semi-binary tree-sibling time consistent phylogenetic networks on a given set of taxa. Indeed, 
as we have already proved, each such a network can be reduced to a tree with two leaves by 
recursively applying the reduction procedures. Since all these procedures are reversible, we can 
effectively construct all networks. However, the computational cost of this construction is high, 
since for obtaining all the sbTSTC networks over a set S of leaves with h hybrid nodes we need 
to recursively construct, first, all the networks with set of leaves S' C S, \S'\ = \S\ — 1 and h 
hybrid nodes, and, second, all those with set of leaves S and h — 1 hybrid nodes. 

The aforementioned Perl package contains a module to construct all tree-child phylogenetic 
networks on a given set of leaves. We are working on a module that generates all sbTSTC 
phylogenetic networks, which will be incorporated in the next release of the package. 

7 Counterexamples 

When we have defined the class of sbTSTC phylogenetic networks, we have remarked that the 
conditions imposed are necessary in order to single out networks by means of its //-representation. 
In this section we give examples of pairs of more general, non-isomorphic networks but with the 
same //-representation. 

In Fig.[T3]we give an example of a pair of semi-binary time consistent networks not satisfying 
the tree-sibling property, and having the same //-representation. 

Consider the phylogenetic networks depicted in Fig. [Ml They are tree-sibling, binary, and the 
single child of each hybrid node is a tree node; however, they do not satisfy the time consistency 
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Figure 14: Non time consistent tree-sibling networks with the same /i-representation. 




Figure 15: Non semi-binary, tree sibling, time consistent networks with the same //- 
representation. 

condition. As it can be easily checked, both networks have the same //-representation. 

Also the semi-binarity is a necessary condition, since first the network in Fig. [T5] is time 
consistent and tree-sibling, but not semi-binary, and has the same /i-representation as the second 
one, which is a sbTSTC network. 

To conclude with this series of counterexamples, the condition that the single child of a 
hybrid node is a tree node is also necessary, as the networks in Fig. [TH both with the same 
//-representation, show. 

8 Conclusions 

While there exist in the literature some algorithms to reconstruct sbTSTC phylogenetic networks 
from biological sequences, no distance metric was known in this class that is both mathemati- 
cally consistent and computationally efficient. The //-distance we have defined fulfills these two 
requirements, and is already implemented in a package included in the BioPerl bundle. 

This //-distance is based on the //-representation of networks: a multiset of vectors of natural 
numbers, each of them associated to a node. This //-representation could also be used to define 
alignments between phylogenetic networks [4j Sec. VI], which are useful in order to display at a 
glance the differences between alternative evolutionary histories of a set of species. Some results 
in this direction will be shortly published elsewhere. 

As a by-product, we have also obtained a procedure to generate all the sbTSTC networks on 
a given set of taxa up to isomorphism. We are working in an efficient implementation for their 
generation, in order to include it in a forthcoming release of BioPerl. 
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